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ABSTRACT 

Because of misconceptions regarding appropriate 
measurement strategies, it is necessary to draw distinctions between 
two major measurement methodologies, norm-referenced and 
criterion-referenced measurement, as they relate to determining basic 
academic capabilities « Norm-referenced measuxes are used to ascertain 
an individual's performance in relationship to the performance of 
other individuals on the Scune measuring device. Criterion-referenced 
measures are used to ascertain an individual's status with respect to 
some criterion^ that is, an explicitly described type of learner 
c<niipetence. Because of the wide use of norm-referenced standardized 
achievement tests, many assume that they are the only instrument:s 
that should be used to find out how well a school is working or a 
pupil is learning. But typical standardized tests are unsuitable for 
these purposes because of problems with their interpretability and 
their psychometric properties. Criterion-referenced tests remedy some 
of these weaknesses because they can: (1) be more accurately 
interpretable; (2) detect the effects of good instruction; and (3) 
allow us to make more accurate diagnoses of individual learners' 
capabilities. If sufficient care is taken to support the development 
of high quality criterion-referenced measures, legislation to 
distribute federal funds <mi the basis of educational deficiencies 
rather than census determiners appears to be sound. (Author/KH) 
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You can't measure mileag*: with a tablesi^oon. But everyone 
knows that, so no one tries to. After all, tablespoons v/ere 
designed to servo a clearly identifiable measurement function, thus 
they are never employed for assessing such things as distance/ 
sound and heat* Significant problems arise, however, when the 
mission of a measuring instrviment is not so patently obvious, hence 
it cap be mistakenly used in situations whereby it yields apparently 
respectable but misleading data* 

For there are seductive dangers associated with the possession 
of data* We live in an increasingly evidence-conscious society, 
and the person who can trot forth a sufficiently impressive array 
of data often becomes the winner in policy disputes. After all, 
our data-devotee will claim that he has the facts and the other side 
operates only on intuition. But, quite obviously, the quality of 
a data-based. argument or decision depends on the quality of the data. 
Injudicious selection of measuring instruments is likely to yield 
indefensible data. Unfortunately, in the field of education we 
are currently suffering from the afflictions of a markedly mis- 
applied measurement tradition. 

Not only with respect to the particular bill currently under 
consideration by this Committee, but because misperceptions regar- 
ding appropriate measurement strategies may^ impinge upon one's 
appraisal of comparable legislation, it is necessary to draw distinc- 
tions between two major measurement methodologies as they relate to 
determining the basic academic capabilities of the nation's youth. 
More specifically, differences will be identified between a norm- 
referenced measurement approach 
approach. 



:nd a criterion-referenced m.easure- 



ment approach. The purposes of these two assessment strategies 
will be examined along with illustrations of how, if the wrong type 
of approach is utilized, misleading data will result. 



The Basic Distinction 

Norm-referenced measures are used to ascertain an individual's 
performance in relationship to the performance of other individuals 
on the same measuring device. The meaninyfulness of an individual 
sco;re emerges from the comparison. It is because the individual 
is compared with some normative group that such measures are descri- 
bed as norm-referenced. Most standardized tests of achievement or 
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intellectual ability used in this country car. he classified as nori-a- 
referenced measures. Such tests are design^^d to yield a series of 
relative performance descriptions, that is, relative to the norm 
group • It is e>qpected that we will be able to distinguish between 
Mary who scores at the 65th percentile (of the norm group) and 
Harry who scores at the 48th percentile (of tLe no^-m group). 

Criterion-referenced measures are used to ascertain an individ- 
ual 's status with respect to some criterion^ that is, an explicitly 
described type of learner competence. It is because the individ- 
ual's performance is compared with an established criterion, rather 
than the performance of other individuals, that these measures are 
described as criterion-referenced. The meaningfulness of an indiv- 
idual score is not dependent on comparisons with other individuals 
who took the test. We want to know what an individual can do, not 
how he stands in comparison to others. For example, the dog owner 
who wants to keep his dog in the back yard may give the dog a f ence- 
jtunping test. The owner wants to find out how high the dog can 
jtunp so that the owner can build a fence high enough to keep the 
dog in the yard. How the dog compares with other dogs is irrelevant. 
Another example of a criterion-referenced test would be the Red 
Cross Senior Lifesaving Test, where an individual must display cer- 
tain swimming skills to pass the examination irrespective of how 
well others perform on the test. Merely because a group of weak 
sv/immers sign up to take the lifesaving test on a given occasion 
would not mean that the best performance of that group would neces- 
sarily be high enough to pass the test. 

Since norm-referenced measures are devised to facilitate com- 
parisons among individuals, it is not surprising that their primary 
purpose is to make decisions about individuals. Which pupils should 
be counseled to pursue higher education? Which pupils should be 
advised to attain vocational skills? These- are the kinds ox ques- 
tions one seeks to answer through the use of norm-referenced meas- 
ures, for many decisions regarding an individual can best be made 
by knowing more about the "competition," that is, by knowing how 
other, comparable individuals perform. 

Although criterion-referenced tests are also used to make 
decisions about individuals, there is usually a difference in the 
context in which such decisions are made. Generally, a norm-refer- 
enced measure is employed where a degree of selectivity is required 
by the situation. For example, when there are only limited openings 
in a company's executive training program, the company is anxious 
to identify the best potential trainees. It is critical in such 
situations, therefore, that the measure permit relative comparisons 
among individuals. On the other hand, in situations where one is 
only interested in whether an individual possesses a particular 
competence, and there are no constraints regarding how many indiv- 
iduals can possess that skill, criterion-referenced measures are 
preferable. In this sense, criterion-referenced measures may be 
considered absolute indicators.* 



|Por a more detailed treatment of the distinctions between norm- 
referenced and criterion-referenced measurement approaches, see 
Popham, W,.J. (Ed.) Criterion-Referenced Measurement ; An Introduc- 
tion . Educational Technology Publications, Englewood Cliffs, N.J., 
1971. 



Tho Misapplied Mear^uroment Tradition 

For many years in our nation we have relied heavily on the 
use of norm-referenced measures. Almost without exception, the 
many standardized achievements tests used throughout the land fit 
the classic norm-referenced measurement model Vfhe.n these devices 
were used in a fashion consistent with their chief mission, that in, 
to permit comparisons airong individual pupils, then appropriate data 
were produced. But whei) these tests were used for other purposes, 
such as to secure a clear picture of what reading skills a partic-- 
ular child possessed, then the resulting data may have typically 
been more misleading than helpful. 

Yet, because these tests have been widely used for so many year 
and because they are produc3d by reputable commercial publishers 
distribute them with a host of sophisticated measurement trappings 
such as technical reliability and validity reports), many educators 
and most citizens assume that standardized achievement tests are 
the only respectable instruments one should use when attempting to 
find out how well our schools are working, or more specifically, 
just how well an indiyidual pupil is learning. 

For purposes such as these, the use of a norm-referenced test 
will often produce spurious data. And the tragedy is that such 
data may be influential in arriving at far-reaching decisions regar- 
ding our nation's educational enterprise. For example, several 
recent reports have focused on extensive analyses of the relative 
contribution of nvimerous fartors to the quality of education. The 
results appear to be disappointing. Teachers-xion' t seem to make 
much of a difference. Financial expenditures don't seem to make 
much of a difference. Indeed, schools themselves don't seem to make 
much of a difference. But much of a difference with respect to what 
Invariably the index of pupil achievement used in these large scale 
analyses has been performance on norm-referenced tests. And, as 
we shall see, there are characteristics of these measures which 
render them sufficiently inappropriate for such analyses that the 
resulting data and subsequent conclusions should be viewed with 
great suspicion if not complete disdain. 



Deficiencies in Norm-Referenced Tests 

There are two main problems with typical standardized tests, 
which render them unsuitable for widescale use in assessing the 
status of our children's educational attainments. These deficits 
are associated with the interpretability and the psychometric prop- 
erties of norm-referenced tests. 

I nt e rpr e t abil itv > Most standardized tests are . developed by 
commercial test publishers who must design the instruments so that 
they can effectively service an entire nation. Practical economics 
preclude test publishers from developing a separate test for New 
York and another version for North Dakota, even though the instruc- 
tional emphases of these two states may vary considerably. The way 
ifhat test publishers get out of this bind is to develop a very 



general test which , while it may not be perfectly congruent with a 
given school district's curricular preferences, will at least cover 
some of them. But to the extent that a particular district is empha- 
sizing content and skills other than those included in the very 
broad standardized test/ a misleading impression of the discrict's 
effectiveness or an individual child's capabilities may be created 
by the use of such tests. 

Indeed, it is to the advantage of the commercial test publishers 
to keep achievement tests at very general levels, for then educators 
throughout the nation can derive the characteristic Rorschach divi- 
dend; they can usually see what they want zo in an ink blot. Thusj 
when certain tests yield subscale scores such as "reading compre- 
hension," it is inordinately difficult to get a precise fix on 
what is meant by that score. Only by dissecting the test itself 
can the user secure a defensible idea of what the instrument is 
measuring. For purposes such as accurately locating our nation's 
educationally disadvantaged youngsters, we need more crisp interpre- 
taions than are afforded by the bulk of norm-referenced tests. 

Just imagine that by employing a standardized achievement test 
we had located a child who scored below the tenth percentile on a 
mathematics achievement test. We know, of course, that we have a 
child who needs help in math. But what kind of help? The typical 
scores on a standardized math achievement test are often given in 
phrases as general as "basic operations" or "geometric relation- 
ships." With such imprecise descriptors it is next to impossible 
to really identify what the learner's weaknesses are, much less 
to correct them. 

Psychometric Properties > As we have seen, the chief purpose 
of norm-referenced tests is to permit comparisons among individuals. 
Because of this, such tests must produce variant scores. In fact, 
the more that pupil scores can be spread out, the better* Test 
items which are answered correctly by most students, since they 
contribute little to total score variance, must be deleted or modi- 
fied, *o contribute to total score variance an ideal item is one 
which is answered correctly by half the people taking the test (pre- 
ferably those who scored highest on the total test) and incorrectly 
by the other half (preferably those who scored lowest on the total 
test) . Most standardized tests which have been revised several 
times contain a great many such items since, for purposes of spread- 
ing out those taking the test, these items function effectively. 
But, in general, such test items are most highly correlated with 
native intellectual ability . In other words, as standardized achieve- 
ment tests are revised and refined through the years in order to max- 
imize the variability of pupil scores, they more and more closely 
resemble a classic intelligence test. Thus, norm-referenced tests 
are often quite insensitive to detecting the effects of even high 
cpiality instruction. 

To illustrate, suppose a teacher cittemptr> to teach an impor- 
tant concept and, prior to instruction, administers a test item 
which almost everyone misses. Yet, after a really fine instruction- 
al job, the same test item is answered correctly by everyone . But, 



because . i produces no score variance among students, this kind of 
item would have to be excluded from a standardized achievement test* 
This not only leads to insensitive tests but creates the further 
problem that oft-revised standardized tests many times do not con- 
tain the very test items which deal with the central concepts of a 
field. 



Counteractions b y Criterion-:ief erenced Tests 

Largely in an effort to remedy some of the weaknesses of norm- 
referenced measures, criterion-referenced tests are designed in such 
a way as to (1) be more accurately interpretable, (2) detect the 
effects of good instruction, and (3) allow us to make more accurate 
diagnoses of individual learners' capabilities. 

Defined Pupil Competencies . One of the important ingredients 
of a well devised criterion-referenced test is an explicitly defined 
criterion. Putting it another way, since the whole conception of 
this measurement strategy is based on referencing scores to a cri- 
terion set of learner behaviors/ then the behaviors must be des- 
cribed without ambiguity. Most current criterion-referenced measure- 
ment specialists are advocating that a domain of learner behaviors 
be delineated in such a way that from the domain description (often 
called an item form) an almost unlimited number of test items could 
be generated. It must be noted that "test item" should be conceived 
of as representing a wide range of measurement techniques, not 
merely paper and pencil tests. Because of the characteristic accur- 
acy of the criterion descriptions, we have a far better idea of 
what it is that the student can or can't do. This becomes par-cicu- 
larly important when, upon assessing the students, we discover seri- 
ous educational deficiencies. With a typical norm-referenced test 
we would have only a global idea of the general sort of student 
weakness; with a criterion-referenced test the deficits can be pin- 
pointed and thus more readily ameliorated. 

Sensitivity to Instruction . Because criterion-referenced tests 
need not produce considerable score variance, they can consist even 
of items which, after instruction, most learners answer correctly. 
They can retain items which are based on the primary curricular 
emphasis • As a consequence, such tests are characteristically more 
sensitive than norm-referenced tests for purposes of detecting 
instructional effects • 

Accurate Diagnoses ,. Because they are more carefully e:>^licated, 
criterion-referenced tests typically provide us with a more fine- 
grained analysis of exactly what the pupil can and can't do. The 
differential skills we hope learners will acquire can be more accur- 
ately portrayed via a well described criterion-referenced te.st in 
contrast to its often amorphous noira-ref erenced counterpart. And 
for promoting instructional improvement, accurate diagnosis is an 
indispensable first step. 



VThat About Tecichinc: to the Test ? 

Discussions such as these often lead to the asseirtion that 
precisely e>qplicated tests will encourage instructors to teach to 
the test, and that suoh a practice is son^-ehow reprehensible. Con- 
trary to the wide-spread belief that teaching to th^ test is an 
instructional sin, we must recognize that if the teg t is truly defen- 
sible / then we should applaud those who can teach pupils to masl'er 
it. The kind of test which will be defensible is not a particular 
set of items / however, but a sample from an almost infinite number 
of items that could be generated from our well described criterion. 
In other words, we should not be teaching to a given set of 10 
double-digit multiplication problems, but instead to any set of 10 
double-digit multiplication problems randomly selected from a well 
defined item pool. Thus the learner acquires mastery of a class 
of skills, not a limited number of items reflected by a particular 
test. This approach is central to proper use of criterion-refer- 
enced testing. 



Spending Money and Measuring Skills 

The general thrust of the legislation currently under consi- 
deration involves the distribution of federal educational funds 
on the basis of measured educational deficiencies rather than census 
determiners. Further, there appears to be a recognition of the im- 
portance of employing appropriate measurement methodology when iden- 
tifying eductionally disadvantaged youngsters. Assuming that suf- 
ficient care can be taken to support the development of high quality 
criterion-referenced measures for this purpose, the general scheme 
fcX targeting federal dollars appears to be sound. For when we are 
at^i^empting to identify those young people who truly need educationcii 
assistance, then using out-dated census figures as the determined- 
may be worse than measuring mileage with a tablespoon. It's more 
like Tneasuring baking soda with a speedometer. 



